In this part of the course, we will cover the following concepts:
| Objective | Complete |
|---|---|
| Describe and build univariate plots to illustrate patterns in the data | |
| Discuss and create bivariate and multivariate plots to illustrate patterns in data |
boxplot, histogram, line plot, density curve, dot plot, QQ plot, bar plotbox package and encode your directory structure into variablesmain_dir be the variable corresponding to your materials folderdata directory inside the materials folder in your environmentdata_dir variableplots directory corresponding to plot_dir variablepaste0 command and pass the strings you would like to paste togetherStroke Dataset: attribute informationdata_dir into R’s environment'data.frame': 5110 obs. of 12 variables:
$ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
$ gender : chr "Male" "Female" "Male" "Female" ...
$ age : num 67 61 80 49 79 81 74 69 59 78 ...
$ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
$ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
$ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
$ work_type : chr "Private" "Self-employed" "Private" "Private" ...
$ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
$ avg_glucose_level: num 229 202 106 171 174 ...
$ bmi : num 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
$ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
$ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
bmi column with the mean# Convert BMI to numeric
health_data$bmi <- as.numeric(health_data$bmi)
# Replace N/A's in BMI column with mean
health_data$bmi[is.na(health_data$bmi)] <- mean(health_data$bmi,na.rm=TRUE)'data.frame': 5110 obs. of 12 variables:
$ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
$ gender : chr "Male" "Female" "Male" "Female" ...
$ age : num 67 61 80 49 79 81 74 69 59 78 ...
$ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
$ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
$ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
$ work_type : chr "Private" "Self-employed" "Private" "Private" ...
$ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
$ avg_glucose_level: num 229 202 106 171 174 ...
$ bmi : num 36.6 28.9 32.5 34.4 24 ...
$ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
$ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
# Line plot.
plot(col_vector,type,col,xlab,ylab)
col_vector is the vector of the column with the numeric valuestype indicates the type of plot and is set to either:
p to create a plot with pointsl to create a plot with a single lineo to create a line chart with the combination of bothcol denotes the color of the lines and points
xlab and ylab denote the label for x-axis and y-axis respectively
bmi column to observe the trend in itbmisummary of the bmi reveals the parameters we will add to a plot10.30): bottom whisker of the boxplot97.60): circle (it’s an outlier!)23.80): bottom of the box28.40): line in the box32.80): top of the box# Set random seed to get the same sample every time!
set.seed(1)
# We have as many variables as columns in our data.
# Save number of columns to a variable using `ncol` function.
n_cols = ncol(health_data)
# Pick a sample of colors for each
# variable in our data set
col_sample = sample(colors(),#<- vector of colors
n_cols) #<- n elements to sample
col_sample [1] "dodgerblue1" "orchid1" "mediumpurple4" "grey38"
[5] "grey9" "gray34" "grey46" "slateblue3"
[9] "grey16" "olivedrab1" "grey69" "skyblue2"
# Make a box plot of relevant variables and give a vector of colors for each of them.
boxplot(health_data$age, health_data$avg_glucose_level, health_data$bmi, col = col_sample) $breaks
[1] 40 60 80 100 120 140 160 180 200 220 240 260 280
$counts
[1] 220 1312 1599 860 298 153 85 149 229 150 46 9
$density
[1] 2.152642e-03 1.283757e-02 1.564579e-02 8.414873e-03 2.915851e-03
[6] 1.497065e-03 8.317025e-04 1.457926e-03 2.240705e-03 1.467710e-03
[11] 4.500978e-04 8.806262e-05
$mids
[1] 50 70 90 110 130 150 170 190 210 230 250 270
$xname
[1] "health_data$avg_glucose_level"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
# Make a combined histogram plot.
par( #<- set plot area parameters with `par`
mfrow = c(1, 2)) #<- split area into 1 row 2 columns with `mfrow`
hist(health_data$age, #<- add 1st histogram
col = col_sample[2],
xlab = "age",
main = "Dist. of age")
hist(health_data$avg_glucose_level, #<- add 2nd histogram
col = col_sample[3],
xlab = "avg_glucose_level",
main = "Dist. of avg_glucose_level")# Bar chart.
barplot(input,xlabel,ylabel,main,names.arg,col)
It includes:
input represents the vector or matrix containing the numeric valuesxlabel and ylabel denote the labels for x-axis and y-axis respectivelynames.arg represents the vector of names that would appear under the barcol parameter denotes the colorpar( #<- set plot area parameters with `par`
mfrow = c(1, 1))
# Create the data for the chart
input <- c(80, 95, 90, 85, 75)
time_i <- c("6AM", "9AM", "12PM", "3PM", "6PM")
# Plotting the bar chart
barplot(input, names.arg=time_i, xlab="Time intervals", ylab="pulse rate of sample patient", col="blue")| Objective | Complete |
|---|---|
| Describe and build univariate plots to illustrate patterns in the data |
✔ |
| Discuss and create bivariate and multivariate plots to illustrate patterns in data |
form of the relationshipstrength of the relationshipdependence of the relationship on external circumstances.plot(health_data$age, #<- variable for x-axis
health_data$avg_glucose_level) #<- variable for y-axisplot(health_data$age,
health_data$avg_glucose_level,
xlab = "age",
ylab = "avg_glucose_level",
main = "age vs avg_glucose_level") pch option from the following optionsscatterplot matrix, correlation plot, parallel coordinates plot, various compound plots using multiple juxtaposed variablesA scatterplot matrix allows us to glance at relationships between multiple variables quickly
To see the scatterplot for variables 1 and 2, select the tile in row 1 and column 2
To create this plot, use the pairs function to pass a range of variables
corrplot packagecorrplot package to create these visualizations# Install package through command line.
#install.packages("corrplot")
# Load the package into the environment.
library(corrplot)corrplot 0.88 loaded
The size of the circle represents the abs. value of the corr. coefficient: \(0 \le cor \le |1|\)
The color of the circle represents the sign (i.e., positive vs. negative)
From this plot we notice that:
age & bmi have a relatively higher corr. coefficient of about 0.33bmi & avg_glucose_level have relatively lower coefficient of over 0.17
You are now ready to try tasks 1-8 in the exercise for this topic
| Objective | Complete |
|---|---|
| Describe and build univariate plots to illustrate patterns in data |
✔ |
| Discuss and create bivariate and multivariate plots to illustrate patterns in data |
✔ |
In this part of the course, we have covered the following concepts: